Toxic Comment Filter

BiLSTM model for multi label classification
code
Deep Learning
Python, R
Author

Simone Brazzi

Published

August 12, 2024

1 Introduction

  • Costruire un modello in grado di filtrare i commenti degli utenti in base al grado di dannosità del linguaggio.
  • Preprocessare il testo eliminando l’insieme di token che non danno contributo significativo a livello semantico.
  • Trasformare il corpus testuale in sequenze.
  • Costruire un modello di Deep Learning comprendente dei layer ricorrenti per un task di classificazione multilabel.

In prediction time, il modello deve ritornare un vettore contenente un 1 o uno 0 in corrispondenza di ogni label presente nel dataset (toxic, severe_toxic, obscene, threat, insult, identity_hate). In questo modo, un commento non dannoso sarà classificato da un vettore di soli 0 [0,0,0,0,0,0]. Al contrario, un commento pericoloso presenterà almeno un 1 tra le 6 labels.

2 Setup

Leveraging Quarto and RStudio, I will setup an R and Python enviroment.

2.1 Import R libraries

Import R libraries. These will be used for both the rendering of the document and data analysis. The reason is I prefer ggplot2 over matplotlib. I will also use colorblind safe palettes.

Code
library(tidyverse, verbose = FALSE)
library(tidymodels, verbose = FALSE)
library(reticulate)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(bslib)
library(Metrics)

reticulate::use_virtualenv("r-tf")

2.2 Import Python packages

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import keras_nlp

from keras.backend import clear_session
from keras.models import Model, load_model
from keras.layers import TextVectorization, Input, Dense, Embedding, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, GlobalMaxPool1D, Flatten, Attention
from keras.metrics import Precision, Recall, AUC, SensitivityAtSpecificity, SpecificityAtSensitivity, F1Score


from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_recall_curve, f1_score, recall_score, roc_auc_score

Create a Config class to store all the useful parameters for the model and for the project.

2.3 Class Config

I created a class with all the basic configuration of the model, to improve the readability.

Code
class Config():
    def __init__(self):
        self.url = "https://s3.eu-west-3.amazonaws.com/profession.ai/datasets/Filter_Toxic_Comments_dataset.csv"
        self.max_tokens = 20000
        self.output_sequence_length = 911 # check the analysis done to establish this value
        self.embedding_dim = 128
        self.batch_size = 32
        self.epochs = 100
        self.temp_split = 0.3
        self.test_split = 0.5
        self.random_state = 42
        self.total_samples = 159571 # total train samples
        self.train_samples = 111699
        self.val_samples = 23936
        self.features = 'comment_text'
        self.labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
        self.new_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', "clean"]
        self.label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.new_label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.path = "/Users/simonebrazzi/R/blog/posts/toxic_comment_filter/history/f1score/"
        self.model =  self.path + "model_f1.keras"
        self.checkpoint = self.path + "checkpoint.lstm_model_f1.keras"
        self.history = self.path + "lstm_model_f1.xlsx"
        
        self.metrics = [
            Precision(name='precision'),
            Recall(name='recall'),
            AUC(name='auc', multi_label=True, num_labels=len(self.labels)),
            F1Score(name="f1", average="macro")
            
        ]
    def get_early_stopping(self):
        early_stopping = keras.callbacks.EarlyStopping(
            monitor="val_f1", # "val_recall",
            min_delta=0.2,
            patience=10,
            verbose=0,
            mode="max",
            restore_best_weights=True,
            start_from_epoch=3
        )
        return early_stopping

    def get_model_checkpoint(self, filepath):
        model_checkpoint = keras.callbacks.ModelCheckpoint(
            filepath=filepath,
            monitor="val_f1", # "val_recall",
            verbose=0,
            save_best_only=True,
            save_weights_only=False,
            mode="max",
            save_freq="epoch"
        )
        return model_checkpoint

    def find_optimal_threshold_cv(self, ytrue, yproba, metric, thresholds=np.arange(.05, .35, .05), n_splits=7):

      # instantiate KFold
      kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
      threshold_scores = []

      for threshold in thresholds:

        cv_scores = []
        for train_index, val_index in kf.split(ytrue):

          ytrue_val = ytrue[val_index]
          yproba_val = yproba[val_index]

          ypred_val = (yproba_val >= threshold).astype(int)
          score = metric(ytrue_val, ypred_val, average="macro")
          cv_scores.append(score)

        mean_score = np.mean(cv_scores)
        threshold_scores.append((threshold, mean_score))

        # Find the threshold with the highest mean score
        best_threshold, best_score = max(threshold_scores, key=lambda x: x[1])
      return best_threshold, best_score

config = Config()

3 Data

The dataset is accessible using tf.keras.utils.get_file to get the file from the url. N.B. For reproducibility purpose, I also downloaded the dataset. There was time in which the link was not available.

Code
# df = pd.read_csv(config.path)
file = tf.keras.utils.get_file("Filter_Toxic_Comments_dataset.csv", config.url)
df = pd.read_csv(file)
Code
library(reticulate)

py$df %>%
  tibble() %>% 
  head(5)
Table 1: First 5 elemtns
# A tibble: 5 × 8
  comment_text            toxic severe_toxic obscene threat insult identity_hate
  <chr>                   <dbl>        <dbl>   <dbl>  <dbl>  <dbl>         <dbl>
1 "Explanation\nWhy the …     0            0       0      0      0             0
2 "D'aww! He matches thi…     0            0       0      0      0             0
3 "Hey man, I'm really n…     0            0       0      0      0             0
4 "\"\nMore\nI can't mak…     0            0       0      0      0             0
5 "You, sir, are my hero…     0            0       0      0      0             0
# ℹ 1 more variable: sum_injurious <dbl>

Lets create a clean variable for EDA purpose: I want to visually see how many observation are clean vs the others labels.

Code
df.loc[df.sum_injurious == 0, "clean"] = 1
df.loc[df.sum_injurious != 0, "clean"] = 0

3.1 EDA

First a check on the dataset to find possible missing values and imbalances.

3.1.1 Frequency

Code
library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels

df_r_grouped <- df_r %>% 
  select(all_of(new_labels_r)) %>%
  pivot_longer(
    cols = all_of(new_labels_r),
    names_to = "label",
    values_to = "value"
  ) %>% 
  group_by(label) %>%
  summarise(count = sum(value)) %>% 
  mutate(freq = round(count / sum(count), 4))

df_r_grouped
Table 2: Absolute and relative labels frequency
# A tibble: 7 × 3
  label          count   freq
  <chr>          <dbl>  <dbl>
1 clean         143346 0.803 
2 identity_hate   1405 0.0079
3 insult          7877 0.0441
4 obscene         8449 0.0473
5 severe_toxic    1595 0.0089
6 threat           478 0.0027
7 toxic          15294 0.0857

3.1.2 Barchart

Code
library(reticulate)
barchart <- df_r_grouped %>%
  ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
  geom_col() +
  labs(
    x = "Labels",
    y = "Count"
  ) +
  # sort bars in descending order
  scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
  scale_fill_brewer(type = "seq", palette = "RdYlBu") +
  theme_minimal()
ggplotly(barchart)
Figure 1: Imbalance in the dataset with clean variable

It is visible how much the dataset in imbalanced. This means it could be useful to check for the class weight and use this argument during the training.

It is clear that most of our text are clean. We are talking about 0.8033 of the observations which are clean. Only 0.1967 are toxic comments.

3.2 Sequence lenght definition

To convert the text in a useful input for a NN, it is necessary to use a TextVectorization layer. See the Section 4 section.

One of the method is output_sequence_length: to better define it, it is useful to analyze our text length. To simulate what the model we do, we are going to remove the punctuation and the new lines from the comments.

3.2.1 Summary

Code
library(reticulate)
df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  pull(text_length) %>% 
  summary() %>% 
  as.list() %>% 
  as_tibble()
Table 3: Summary of text length
# A tibble: 1 × 6
   Min. `1st Qu.` Median  Mean `3rd Qu.`  Max.
  <dbl>     <dbl>  <dbl> <dbl>     <dbl> <dbl>
1     4        91    196  378.       419  5000

3.2.2 Boxplot

Code
library(reticulate)
boxplot <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  # pull(text_length) %>% 
  ggplot(aes(y = text_length)) +
  geom_boxplot() +
  coord_flip() +
  theme_minimal()
ggplotly(boxplot)
Figure 2: Text length boxplot

3.2.3 Histogram

Code
library(reticulate)
df_ <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
  )

Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)

histogram <- df_ %>% 
  ggplot(aes(x = text_length)) +
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
  theme_minimal() +
  xlab("Text Length") +
  ylab("Frequency") +
  xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)
Figure 3: Text length histogram with boxplot upper fence

Considering all the above analysis, I think a good starting value for the output_sequence_length is 911, the upper fence of the boxplot. In the last plot, it is the dashed red vertical line.. Doing so, we are removing the outliers, which are a small part of our dataset.

3.3 Dataset

Now we can split the dataset in 3: train, test and validation sets. Considering there is not a function in sklearn which lets split in these 3 sets, we can do the following: - split between a train and temporary set with a 0.3 split. - split the temporary set in 2 equal sized test and val sets.

Code
x = df[config.features].values
y = df[config.labels].values

xtrain, xtemp, ytrain, ytemp = train_test_split(
  x,
  y,
  test_size=config.temp_split, # .3
  random_state=config.random_state
  )
xtest, xval, ytest, yval = train_test_split(
  xtemp,
  ytemp,
  test_size=config.test_split, # .5
  random_state=config.random_state
  )

xtrain shape: py$xtrain.shape ytrain shape: py$ytrain.shape xtest shape: py$xtest.shape ytest shape: py$ytest.shape xval shape: py$xval.shape yval shape: py$yval.shape

The datasets are created using the tf.data.Dataset function. It creates a data input pipeline. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The tf.data.Dataset is an abstraction that represents a sequence of elements, in which each element consists of one or more components. Here each dataset is creates using from_tensor_slices. It create a tf.data.Dataset from a tuple (features, labels). .batch let us work in batches to improve performance, while .prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Check the documentation for further informations.

Code
train_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtrain, ytrain))
    .shuffle(xtrain.shape[0])
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

test_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtest, ytest))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

val_ds = (
    tf.data.Dataset
    .from_tensor_slices((xval, yval))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)
Code
print(
  f"train_ds cardinality: {train_ds.cardinality()}\n",
  f"val_ds cardinality: {val_ds.cardinality()}\n",
  f"test_ds cardinality: {test_ds.cardinality()}\n"
  )
train_ds cardinality: 3491
 val_ds cardinality: 748
 test_ds cardinality: 748

Check the first element of the dataset to be sure that the preprocessing is done correctly.

Code
train_ds.as_numpy_iterator().next()
(array([b'"Along with the evident commitment of Wikipedians, I also found Wikipedia\'s numerous policies, guidelines and normative behaviors intimidating to me as a newcomer. Kraut and Resnick noted that techniques for encouraging voluntary compliance, such as explicitly stating and prominently displaying guidelines, tend to be more effective with ""insiders who care about the community\xe2\x80\x99s health and their own standings within the community."" They identified four factors which increase voluntary compliance: \xe2\x80\x9ccommitment to the community, legitimacy of norms, the ability to save face, and expectations about rewards for compliance or sanctions for noncompliance.""  With an underdeveloped sense of commitment to the community, a lack of comprehension regarding the strong community norms, and an expectation of sanctions for noncompliance, I was less likely to be bold, or participate, in the community unless directed to do so. The most paralyzing of these factors was the ""high probability that norm violations [would] be detected"" ensured by Wikipedia\'s ""anyone can edit"" community design. \n\n"',
       b'South America \nCamino Real in Peru may be a good entry.  These are the roads connecting the Inca Empire throughout the Andes.  It is historical, but El Camino Real is notable, and a street, so I dont know if it should be included',
       b'"\nCongrats!  ( Ding my phone My support calls E-Support Options ) "',
       b'Perhaps both parties of this issue would be better served by taking a temporary break, and agreeing to come back to Wikipedia in, say, 48 hours?  Mediation might not be necessary at all.',
       b"Dr Ernie Smith writes:\n\n\xc3\x86\xc2\xb5\xc2\xa7\xc5\x93\xc5\xa1\xc2\xb9  I agree with your second option, i.e., \xe2\x80\x9cto follow that paragraph with further academic discussion or even report on a scholarly response to Baugh's work\xe2\x80\x9d.  You should know that after checking the references again I discerned that you are abolutely correct.  Indeed, it is not Blackshire Belay that Baugh posits as his source for Afrocentic view number (ii).  Based on the footnote number 8 that appears after the first phrase in numder (ii), i.e., \xe2\x80\x9crefer to the languages of the African diaspora as a whole;[8]\xe2\x80\x9d Baugh actually attributes this view to; Williams (1975) and Williams (1997).   This prompts the question; is the second phrase that appears to be a continuation of number (ii) which states; \xe2\x80\x9cor it may refer to what is normally regarded as a variety of English: either\xe2\x80\x9d also included as what Dr.Williams posits?  For if it is attributed to Dr. Williams, I still say it is not Afrocentric and to posit this phrase as an Afrocentric view is in fact an outright misrepresentation of what Dr. Williams has stated.\n\nErnie Smith ~",
       b'"\n\nBritishWatcher: Thanks for your courteous and transparent question (the first I have seen from the UK clique since I got ""involved"").  I\'ll address your second point.  If my answer to that is not clear, there is little point taking the time to suggest actual changes.  Your clarifying question mentions the premise that each part of the UK has its own subdivisions.  The article states this, but your phrasing implies that the parts are not themselves subdivisions.  The article says nothing about that.  Besides the lack of clarity on that point (assuming your reading is correct), there is another confusion in the article: I think the majority of the people reading this article will understand that the four countries are the focus of the article and that the title of the article (Subdivisions...) refers to them.  If it is the case that the parts are not themselves subdivisions, then the article is far from clear on this point (except to the three or four who have been ""involved"").  Hence, there are two confusions in the article: 1) the title suggests that the parts are subdivisions (because of the structure and general content of the article) and 2) since the article explicitly divides the UK into the four countries, this suggests that the parts are themselves subdivisions.  Your reading suggests that the title of the article should be ""subdivisions of the four parts of UK"", not ""subdivisions of the UK"".  Until these confusions are addresses, I think the article casts more shadow than light on the subject.   "',
       b'The term death metal come from the band death.',
       b'"\n\n Admin Help Needed \n\n Question for administrator \n\nRavinder121 is deleting text from my talk page. Please ban him from deleting text from my talk page.\n\nAlso, the user Ravinder121 is engaging in disruptive editing editing in the following wiki page ""Chamar"". Please ban him from making any changes to this page.\n  "',
       b'"\n\n They use the words ""potential effects"", ""could"", etc. the tone is very different and expresses an appropriate level of scientific uncertainty with a very detailed examination of climate, it\'s complexity, links, etc. the whole page needs to be revamped!\n\n Certainly the Wikipedia page is free. But if you\'re interested in maintaining a good page, especially on Global Warming, it should be longer than a page on the entire collection of pages related to Mars and all other planets combined -due to it\'s importance, and not seem such a matter of opinion and vengence against the right, or whatever it is the tone of the admins are directed towards.  "',
       b'Balrog \n\nHave you seen the Street Fighter movies? Balrog looked like a 210cm tall giant. I think that 198cm is an underestimated height for Balrog.',
       b"Yeah, that's a great explanation. , anyway, thank you for your reply.",
       b'"\n\n Definition \n\nI\'m not happy with the definition as it stands, and I would like to see a reference for the assertion that Horslips were the first to name the genre. ""Celtic"" in modern usage means more than the Ancient Celts, and much music that involves the lives of Modern Celts is considered Celtic. Thus, there\'s no reason a band like The Pogues - who sung frequently about Irish matters - should be considered less Celtic simply because they were focused on the present. Similarly, I see no reason Horslips\' ""driving hard rock"" is more suited towards being real ""Celtic Rock"" than the blues influenced music of L&L; era Fairport.  "',
       b'My day ==\n\nMany thanks for that! I will aim to enjoy it. Happy Christmas & New Year to you & yours. My Wiki-Christmas card is here.   \n\n==',
       b"Glad to see you guys are interested in an expanded discussion. I think CfD is appropriate; I had a very productive discussion about Referendum categories, that had nothing to do with deletion. But, Talk:Natural history or Talk:Geology might be appropriate as well. Let me know where you guys go, and I'll try to drop in\xe2\x80\xa6 -",
       b'"\n\n I followed the links to an article on ""consensus decision making"". The entire article seemed to me to be on the simplistic side (perhaps due to my ongoing research in Organizational Communication), but it had a section on ""Quaker decision making."" I don\'t know if it would be appropriate to link back here (sort of a 360 degree thing), but it certainly is relevant. \n\n Perhaps an article on ""Quaker meetings"" might be worth considering? I\'m thinking in terms of examining several types of Quaker meetings (clearness, weddings, threshing, as well as business) in light of Quaker philosophies or traditional practices. Put another way, I am growing increasingly sensitive to cataloging ""weird Quaker ways of doing things"" in a de-contextualized sense. Perhaps we would be better off discussing how the Quakers\' beliefs led to these different ways of meeting? \n\n Some suggestiongs: \nUniversal ministry (everyone a minister), \nimmediate access (not mediated by Priest or ecclesiastical structure), \nprivileging peacemaking (refusing to accept any sort of social darwinism in which the strongor aggressively vocalprevail), and \ntestimony of integrity (challenging participants to put beliefs into actions, even if it leads to unconventional methods)\n\n It doesn\'t have to be an apologetic for Quaker practices, but it would help explain why these practices are held so closely by Quakers, even of dramatically different theologies.\n\n And thank you all, for putting up with my ""bull in a china shop"" entrance. My entries could have been read as patronizing and as assuming that I was the only ""real"" Quaker present. I did not intend to communicate that; I am sorry. I suppose that several Quaker sensibilities were at work even here, among the tools and spare parts! Thank you for your gracious patience, all. \n\n Roy  (but I still don\'t get all of the clever codes and secret handshakes. . .apparently there are different norms for the page discussion site than for the program/topic discussion site, and for a third sort of discussion place (which I haven\'t really assimilated in a coherent manner, yet. . .)"',
       b'"\n\n Begin text copy from logical subpage to fix broken link and to restore continuity of dialog \n\n<>\n\nCreationism is the explanation  that the universe and all life were created by the deliberate act of God.\n\nIn looking through the historical record at the competition between creation and evolution in Darwin\'s day, I was impressed by Thomas Huxley\'s 1887 account of how Origin of Species provided the first explanation that in Huxley\'s view was a better explanation than creation.  Huxley describes the sense in which he rejected creation as an explanation.\n\nIf Agassiz told me that the forms of life which had successively tenanted the globe were the incarnations of successive thoughts of the Deity; and that he had wiped out one set of these embodiments by an appalling geological catastrophe as soon as His ideas took a more advanced shape, I found myself not only unable to admit the accuracy of the deductions from the facts of paleontology, upon which this astounding hypothesis was founded, but I had to confess my want of any means of testing the correctness of his explanation of them. And besides that, I could by no means see what the explanation explained. \n\nHuxley describes his similar rejection of the explanations of the evolutionists prior to Darwin.\n\nAnd, by way of being perfectly fair, I had exactly the same answer to give to the evolutionists of 1851-8. . . . [A] thorough-going evolutionist, was Mr. Herbert Spencer, whose acquaintance I made, I think, in 1852. . . . Many and prolonged were the battles we fought on this topic. But even my friend\'s rare dialectic skill and copiousness of illustration could not drive me from my agnostic position. I took my stand upon two grounds: firstly, that up to that time, the evidence in favor of transmutation was wholly insufficient; and, secondly, that no suggestion respecting the causes of the transmutation assumed, which had been made, was in any way adequate to explain the phenomena. Looking back at the state of knowledge at that time, I really do not see that any other conclusion was justifiable. \n\nFurthermore, any self-respecting religion-neutral anthropologist, such as Robert L. Carneiro, Curator of the American Museum of Natural History, would classify creation and evolution as mere successive stages of incomplete but improving explanations in a universe where there is no God to assist the women and men who attempt to discover the truth of their origins. \n\nFrom all of the above, I suggest that it is more accurate to define creationism as an explanation rather than a belief.  After all, the survival of the belief derives from the usefulness of the belief, and a primary use of creationism is explaining how we all got here.  According to Thomas Huxley, until Origin of Species, creationism was as good an explanation as evolutionism.  And for the majority of American voters who cannot understand the evolutionists\' explanations, creationism is a better explanation than evolutionism even yet today.  - 16:08, 25 Aug 2004 (UTC)\n\n Considering that ""Belief in the psychological sense is a representational mental state that takes the form of a propositional attitude and in the religious sense, belief refers to a part of a wider spiritual or moral foundation, generally called faith, and that creationism is part and parcel of the christian faith, I think the use of the term ""belief"" was completely justified. Creationism is indeed an explanation, but it is an explanation founded on belief, hence it is a belief. It is not founded on knowledge or evidence; to imply otherwise, which is what you\'re doing, is to create a false impression that Creationism shares some sort of parity with other explanations which do not require belief in the supernatural. It does not. You seem to be substituting your own personal bias for this imputed ""evolutionist bias"" you claim is on the Creationism page.  16:48, 25 Aug 2004 (UTC)\n\n-\n\n""Personal bias""?  Nope.  I have b',
       b'"\nI glad you know why I\'m on probation, however I\'m still none the wiser! That diff you cite is as relevant as this one, you make a claim you back it up! More so [for Admin\'s and they are expected to lead by example. Now I\'m being very reasonable and patient and have been making every attempt to resolve this, as have a number of editors, this should really not be portrayed as editors making a point. Now I\'ll allow Elonka the opportunity to respond or better still provide the diff\'s that you are adept enough to be able to pick out. \'fenian\' "',
       b'"\n\n I\'m not seeing what points readers to Hotchkiss in popular culture. Please clarify.\n\nIf the section is called ""Hotchkiss in Print"" then maybe movie references are not appropriate here, or the section should be renamed.\n\nAs to your point about ""disgruntled alums:"" Whether the quotations are positive or negative is immaterial. If they are references in print and they shed light on the history and reputation of the school, then the quotations seem entirely appropriate here. Would we want to include every last mention of Hotchkiss in the written record? Perhaps not. However, below are reasons for including quotes I have recently added (actually, re-added because they were removed at some point).\n\nThomas Hoving quotation: Hoving is one of the more notable alums, a longtime director of the Metropolitan Museum of Art. As a public figure who is also an alum, his perspective on the school is significant. McPhee is also a leading writer of profiles and this piece originally appeared in the New Yorker. So, it\'s authoritative.\n\nThe Cullman quotation: I believe Cullman was chairman of the board of trustees, and the Cullman family has been one of the school\'s major supporters. Since Cullman is an alumnus and a prominent supporter, his comments on the school are worth noting.\n\nJulian Houston is a judge in Massachusetts. As a successful legal professional, an alumnus, and a person of color, who has written a book that is clearly based on his experience at the school, it is appropriate to quote him on his time at the school.\n\nArchibald MacLeish is among the school\'s most prominent alumni. Because of his stature, his comments on the school are notable.\n\nLemisch is both an alumnus and an emeritus history professor, who has written an article on the history of Hotchkiss. Because he is an alum and a historian, and because the subject of this article is the history of the school, a quotation from it is notable.\n\nThe above reasons support the relevance of these quotations to this article.\n\nIf there are other quotes by prominent alums who enjoyed their time at Hotchkiss, of course those would be worth citing here as well. "',
       b'It comes from a desire to become a Wikipedia admin, as they are a bunch of Jew fags.',
       b"Shucks, he's right. This article reads like szhophrenics need to 'get over it' or have a good sit down session with their shrink. it's slanted!",
       b"Assuming you are asking if there is a way to keep the guest stars in the infobox (at least that's how I'm reading it), your best bet is to take that to Template_talk:Infobox_Television_episode. If there's no objections, ping me in a week or so and I might be able to code it up for you.",
       b'"\n\n Requested move \n\nZabranjeno Pu\xc5\xa1enje \xe2\x86\x92 Zabre\xc7\x8ceno Pu\xc5\xa1e\xc7\x8ce \xe2\x80\x93 the name of the band is misspelled with digraphs ""\xc7\x8c"" (""U+01CC LATIN SMALL LETTER NJ"") spelled out as combinations of ""n"" and ""j"" letters, which goes against the rules of  (or Bosnian if one prefers) language. As long as there is no English name of the band is known (except for the Belgrade branch of the band after the split), the native language convention should be followed per . \xe2\x80\x94   "',
       b"Gee, didn't know there was such a thing. Thanks,",
       b'"\n\n Release date \n\nAn anon. IP has edited the page to alter the release date:\n\nDespite common misconception, the game was NOT released in all territories simultaneously, and it came out in November, not August. I have no idea why everybody thinks it was August. I\'m looking at a copy of Dec 97 N64 Mag right now - it was Nov.\n\nGameSpot\'s review is dated August 1997, and when Edge recently put their original review online, they said it was originally printed in the July 1997 issue. (Edge is a UK publication, but its reviews usually reflect a game\'s earliest worldwide release - or at least its earliest English language release.)\n\nTo the anon editor: a magazine with a different release date might prove your assertion that the game was not a simultaneous worldwide release. So what country is your copy of that magazine from?\n\nDoes anyone else have copies of magazines from the game\'s launch that could be cited as references for its release date? Talk "',
       b'I think you need to realise its not as important as you health and peace of mind, think about it. 175.110.222.144',
       b'"\n Wikimedia Commons \n\nThank you for uploading images/media  to Wikipedia! There is, however, another Wikimedia Foundation project called Wikimedia Commons, a central media repository for all free media. In future, please upload media there instead (see m:Help:Unified login). That way, all of the other language Wikipedias can use them too, as well as our many sister projects. This will also allow our visitors to search for, view and use our media in one central location. If you wish to move previous uploads to Commons, see Wikipedia:Moving images to the Commons (you may view images you have previously uploaded by going to your user contributions on the left and choosing the \'image\' namespace from the drop down box). Please note that non-free content, such as images claimed as fair use, cannot be uploaded to the Wikimedia Commons. Help us spread the word about Commons by informing other users, and please continue uploading!samaK \n\n Waterfalls \nHamilton looks like an impressive place for waterfalls.  I""ll certainly have to include it in my list of places to visit.  I can\'t speak for any of the areas in the Pacific Northwest (which is home to hundreds of high, impressive falls), in North Carolina, according to Kevin Adams\' ""North Carolina Waterfalls: A Hiking and Photography Guide"", our state is estimated to have around 1,000 - 1,500 falls in the state, most of them in Transylvania County, NC (aka ""The Land of Waterfalls"").  Estimates are that TC contains between 400-500 ""major"" waterfalls in an area slightly smaller than the City of Hamilton (390 sq mi versus Hamilton\'s 439 sq mi).  As for what you call ""major"" - that\'s pretty much up to the viewer.  One man\'s gushing torrent is another man\'s damp spot on the side of a rock.  In TC, a major waterfall is usually considered at least 15-20\' high - not including talus - or having enough current to be an obvious waterfall (such as Hooker Falls).\n\nIf you ever get the chance to visit NC, please do.  We\'ve got a lot of beautiful, rugged country.  To this day, I get reports about new waterfalls discovered by people who are just now able to scamper through some of the roughest terrain the east coast has to offer.    \n\nPart of the problem with defining the ""City with the most waterfalls"" is that neither term is overly well defined.  ""City"" can mean ""town limits"" or ""Metropolitan Area"" or any other of a dozen different definitions.  Then you have the problem with the exact definition of a waterfall.  Ergo, you have the problem.   \n\nTrail in New Zealand\nI don\'t know of another town with that many waterfalls.  However I do know that when it rains really heavily the valley that you walk up first on the Milford Track, has over 300 waterfalls. -   \n\nWaterfalls\nThere are loads of waterfalls in Washington State in the Olympic and Cascade mountains.  I am only really documenting major waterfalls, not all the little ones.  Konrad    \n\nFudge\n\nWhere on the blogspot site does it say that, you need to link to the specific page. And a fansite doesn\'t really count as a reliable source. Twa2 \n\nWaterfalls\nRegarding the 100 waterfalls Hamilton, I am unaware of another city area with that many.  There are probably ""areas"" with about that much, such as the Columbia River Gorge, but the exact number there I am ignorant of and they are not that many major ones.  Here on the Olympic Peninsula we have a number of falls, but it is a really large area and probably does not reach the density that you have indicated there in Hamilton.  However it must be noted that I am a amateur at this and am ignorant of much   \n\nMedia in Hamilton, Ontario\nBecause media lists of this type were formerly titled in a variety of naming formats, there was a discus',
       b'"::::::OMG you seriously nominated this article for deletion because you think Jade Goody has never ""done anything worthy""??? She\'s a British celebrity, she HAS written a book - her autobiography, she brought out her own perfume...her having terminal cancer is by no means the reason as to why she has a wiki article and if you seriously think that nothing she has done has ever been notable then you really need to buck up on your knowledge of her.   \n\n"',
       b'"\n\n revived / remains \n\nSo:\n What Toptchyan wrote is that after Malkhasyants (1940, and Sarkissian\'s article is from March 1940, so just before it) and untill Toumanoff (1961), thus ""second half of the 20th century"", there has been no serious scholarship opposing the 5th c. dating. You (Grandmaster) mentioned earlier Joseph P. Smith (1952) but (i) his book has an extremely thin link with the debate on the dating, (ii) he\'s not specialised in this area, and most of all (iii) he didn\'t take side in his very few lines on Moses. Topchyan not being contradicted on this, it is clear that there\'s a break of 21 years between Malkhasyants\' book and the next serious scholar contesting the 5th c. dating (according to Topchyan, ""la publication de son livre r\xc3\xa9gla la question pour environ deux d\xc3\xa9cennies"" or in English ""the publishing of his book settled the issue for approximately two decades""). Therefore, during these two decades, the issue didn\'t ""remain"" disputed (i.e. no serious scholarship published), and Toumanoff\' s publishing ""revived"" the critical points in 1961.\n Where you perhaps have a point (with ""remains"") is that Toumanoff\'s main point, the ""Bagratids argument"", had already been raised before by Adontz and Manandian. It had been addressed by Malkhasyants though (i.e. Moses wrote that Bagratids were tagadir ev aspet, two titles from the Arsacid period which were forgotten in latter centuries).\n I think the article should represent both the historiographical break and the historiographical link, and I think the current sentence do so (""revived... continue to maintain""). I tried to reword it, but in fact, the more I think about the issue and the more I read the sentence, the less I see the problem with it.\n  \nOh, btw, no need to attack Toptchyan: Mah\xc3\xa9 for instance gives absolutely no reference between 1940 and 1961.   \n\nDid you actually read Sarkissian\'s article? He quotes authors from 1920s and 1930s, and says: \n\nSuch was, and to a large extent still is, the traditional view about the life and the work of Moses of Khoren which was accepted by the Armenians. It is this traditional and unquestioned view that has been subjected to much severe criticism during the course of the past hundred years. The object of this paper is to summarize and evaluate such criticism.\n\nSo according to Sarkissian, during the 100 years before 1940 the 5th century dating was constantly in dispute. Thus, Topchyan contradicts other sources. Criticism was not revived, it always existed. If you insist on revived claim, it must be presented only as Topchyan\'s personal opinion, and not as a fact, because it is not true. master "',
       b'"\nIf you phrase it like this, I don\'t see how they could possibly disagree:\n\n"" Dr.Winterberg\'s calculation made general relativistic corrections to atomic clocks in orbit.  Today, these relativistic corrections are used to provide the  precise accuracy required for the GPS satellite system."""',
       b'"\n\n Please stop. If you continue to blank out (or delete portions of) page content, templates or other materials from Wikipedia, you will be blocked from editing.  drama "',
       b'":  I guess that, in your second category (""indirect parliamentarism""), the most common arrangement is actually for the Head of State (monarch or president) to appoint a prime minister and, then, appoint all other government of ministers on proposal of the latter. That is what happens both in the UK (where the power to appoint the government rests with the Queen)  and in the France (where it is the President of the Republic who appoints the Council of Ministers). In Germany, the Federal President also appoints all other federal ministers on the recommendation of the chancellor, but the chancellor him/herself is actually elected by the Bundestag, thus making Germany fall into your proposed ""direct parliamentarism"" category instead.\n\n"',
       b'"\n I think I\'m on pretty safe ground stating that ""Absolute load of shit"" is being uncivil. Have a look a the recent comments made by the  when he blocked LevenBoy for using the word ""Bullshit"".   Talk  "'],
      dtype=object), array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0]]))

And we check also the shape. We expect a feature of shape (batch, ) and a target of shape (batch, number of labels).

Code
print(
  f"text train shape: {train_ds.as_numpy_iterator().next()[0].shape}\n",
  f" text train type: {train_ds.as_numpy_iterator().next()[0].dtype}\n",
  f"label train shape: {train_ds.as_numpy_iterator().next()[1].shape}\n",
  f"label train type: {train_ds.as_numpy_iterator().next()[1].dtype}\n"
  )
text train shape: (32,)
  text train type: object
 label train shape: (32, 6)
 label train type: int64

4 Preprocessing

Of course preprocessing! Text is not the type of input a NN can handle. The TextVectorization layer is meant to handle natural language inputs. The processing of each example contains the following steps: 1. Standardize each example (usually lowercasing + punctuation stripping) 2. Split each example into substrings (usually words) 3. Recombine substrings into tokens (usually ngrams) 4. Index tokens (associate a unique int value with each token) 5. Transform each example using this index, either into a vector of ints or a dense float vector.

For more reference, see the documentation at the following link.

Code
text_vectorization = TextVectorization(
  max_tokens=config.max_tokens,
  standardize="lower_and_strip_punctuation",
  split="whitespace",
  output_mode="int",
  output_sequence_length=config.output_sequence_length,
  pad_to_max_tokens=True
  )

# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)

This layer is set to: - max_tokens: 20000. It is common for text classification. It is the maximum size of the vocabulary for this layer. - output_sequence_length: 911. See Figure 3 for the reason why. Only valid in "int" mode. - output_mode: outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1. - standardize: "lower_and_strip_punctuation". - split: on whitespace.

To preserve the original comments as text and also have a tf.data.Dataset in which the text is preprocessed by the TextVectorization function, it is possible to map it to the features of each dataset.

Code
processed_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)

5 Model

5.1 Definition

Define the model using the Functional API.

Code
def get_deeper_lstm_model():
    clear_session()
    inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
    embedding = Embedding(
        input_dim=config.max_tokens,
        output_dim=config.embedding_dim,
        mask_zero=True,
        name="embedding"
    )(inputs)
    x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
    x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
    # Global average pooling
    x = GlobalAveragePooling1D()(x)
    # Add regularization
    x = Dropout(0.3)(x)
    x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = LayerNormalization()(x)
    outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
    model = Model(inputs, outputs)
    model.compile(optimizer='adam', loss="binary_crossentropy", metrics=config.metrics, steps_per_execution=32)
    
    return model

lstm_model = get_deeper_lstm_model()
lstm_model.summary()

5.2 Callbacks

Finally, the model has been trained using 2 callbacks: - Early Stopping, to avoid to consume the kaggle GPU time. - Model Checkpoint, to retrieve the best model training information.

Code
# callbacks
my_es = config.get_early_stopping()
my_mc = config.get_model_checkpoint(filepath="/checkpoint.keras")
callbacks = [my_es, my_mc]

5.3 Final preparation before fit

Considering the dataset is imbalanced, to increase the performance we need to calculate the class weight. This will be passed during the training of the model.

Code
lab = pd.DataFrame(columns=config.labels, data=ytrain)
r = lab.sum() / len(ytrain)
class_weight = dict(zip(range(len(config.labels)), r))
df_class_weight = pd.DataFrame.from_dict(
  data=class_weight,
  orient='index',
  columns=['class_weight']
  )
df_class_weight.index = config.labels
Code
library(reticulate)
py$df_class_weight
Table 4: Class weight
              class_weight
toxic          0.095900590
severe_toxic   0.009928468
obscene        0.052757858
threat         0.003061800
insult         0.049132042
identity_hate  0.008710911

It is also useful to define the steps per epoch for train and validation dataset. This step is required to avoid to not consume entirely the dataset during the fit, which happened to me.

Code
steps_per_epoch = config.train_samples // config.batch_size
validation_steps = config.val_samples // config.batch_size

5.4 Fit

The fit has been done on Kaggle to levarage the GPU. Some considerations about the model:

  • .repeat() ensure the model sees all the dataset.
  • epocs is set to 100.
  • validation_data has the same repeat.
  • callbacks are the one defined before.
  • class_weight ensure the model is trained using the frequency of each class, because our dataset is imbalanced.
  • steps_per_epoch and validation_steps depend on the use of repeat.
Code
history = model.fit(
  processed_train_ds.repeat(),
  epochs=config.epochs,
  validation_data=processed_val_ds.repeat(),
  callbacks=callbacks,
  class_weight=class_weight,
  steps_per_epoch=steps_per_epoch,
  validation_steps=validation_steps
  )

Now we can import the model and the history trained on Kaggle.

Code
model = load_model(filepath=config.model)
history = pd.read_excel(config.history)

5.5 Evaluate

Code
validation = model.evaluate(
  processed_val_ds.repeat(),
  steps=validation_steps, # 748
  verbose=0
  )
Code
val_metrics <- tibble(
  metric = c("loss", "precision", "recall", "auc", "f1_score"),
  value = py$validation
  )
val_metrics
Table 5: Model validation metric
# A tibble: 5 × 2
  metric     value
  <chr>      <dbl>
1 loss      0.0542
2 precision 0.789 
3 recall    0.671 
4 auc       0.957 
5 f1_score  0.0293

5.6 Predict

For the prediction, the model does not need to repeat the dataset, because it has already been trained on all of the train data. Now it has just to consume the new data to make the prediction.

Code
predictions = model.predict(processed_test_ds, verbose=0)

5.7 Confusion Matrix

The best way to assess the performance of a multi label classification is using a confusion matrix. Sklearn has a specific function to create a multi label classification matrix to handle the fact that there could be multiple labels for one prediction.

5.7.1 Grid Search Cross Validation for best threshold

Grid Search CV is a technique for fine-tuning hyperparameter of a ML model. It systematically search through a set of hyperparamenter values to find the combination which led to the best model performance. In this case, I am using a KFold Cross Validation is a resempling technique to split the data into k consecutive folds. Each fold is used once as a validation while the k - 1 remaining folds are the training set. See the documentation for more information.

The model is trained to optimize the recall. The decision was made because the cost of missing a True Positive is greater than a False Positive. In this case, missing a injurious observation is worst than classifying a clean one as bad.

5.7.2 Confidence threshold and Precision-Recall trade off

Whilst the KFold GDCV technique is usefull to test multiple hyperparameter, it is important to understand the problem we are facing. A multi label deep learning classifier outputs a vector of per-class probabilities. These need to be converted to a binary vector using a confidence threshold.

  • The higher the threshold, the less classes the model predicts, increasing model confidence [higher Precision] and increasing missed classes [lower Recall].
  • The lower the threshold, the more classes the model predicts, decreasing model confidence [lower Precision] and decreasing missed classes [higher Recall].

Threshold selection mean we have to decide which metric to prioritize, based on the problem we are facing and the relative cost of misduging. We can consider the toxic comment filtering a problem similiar to cancer diagnostic. It is better to predict cancer in people who do not have it [False Positive] and perform further analysis than do not predict cancer when the patient has the disease [False Negative].

I decide to train the model on the F1 score to have a balanced model in both precision and recall and leave to the threshold selection to increase the recall performance.

Moreover, the model has been trained on the macro avarage F1 score, which is a single performance indicator obtained by the mean of the Precision and Recall scores of individual classses.

\[ F1\ macro\ avg = \frac{\sum_{i=1}^{n} F1_i}{n} \]

It is useful with imbalanced classes, because it weights each classes equally. It is not influenced by the number of samples of each classes. This is sette both in the config.metrics and find_optimal_threshold_cv.

5.7.2.1 f1_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_f1, best_score_f1 = config.find_optimal_threshold_cv(ytrue, y_pred_proba, f1_score)

print(f"Optimal threshold: {optimal_threshold_f1}")
Optimal threshold: 0.15000000000000002
Code
print(f"Best score: {best_score_f1}")
Best score: 0.4788653077945807
Code

# Use the optimal threshold to make predictions
final_predictions_f1 = (y_pred_proba >= optimal_threshold_f1).astype(int)

Optimal threshold f1 score: 0.15. Best score: 0.4788653.

5.7.2.2 recall_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_recall, best_score_recall = config.find_optimal_threshold_cv(ytrue, y_pred_proba, recall_score)

# Use the optimal threshold to make predictions
final_predictions_recall = (y_pred_proba >= optimal_threshold_recall).astype(int)

Optimal threshold recall: 0.05. Best score: 0.8095814.

5.7.2.3 roc_auc_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_roc, best_score_roc = config.find_optimal_threshold_cv(ytrue, y_pred_proba, roc_auc_score)

print(f"Optimal threshold: {optimal_threshold_roc}")
Optimal threshold: 0.05
Code
print(f"Best score: {best_score_roc}")
Best score: 0.8809499649742268
Code

# Use the optimal threshold to make predictions
final_predictions_roc = (y_pred_proba >= optimal_threshold_roc).astype(int)

Optimal threshold roc: 0.05. Best score: 0.88095.

5.7.3 Confusion Matrix Plot

Code
# convert probability predictions to predictions
ypred = predictions >=  optimal_threshold_recall # .05
ypred = ypred.astype(int)

# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(ax=axes[i], colorbar=False)
    axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()
Figure 4: Multi Label Confusion matrix

5.8 Classification Report

Code
cr = classification_report(
  ytrue,
  ypred,
  target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  )
Table 6: Classification report
# A tibble: 10 × 5
   metrics       precision recall `f1-score` support
   <chr>             <dbl>  <dbl>      <dbl>   <dbl>
 1 toxic            0.552  0.890      0.682     2262
 2 severe_toxic     0.236  0.917      0.375      240
 3 obscene          0.550  0.936      0.692     1263
 4 threat           0.0366 0.493      0.0681      69
 5 insult           0.471  0.915      0.622     1170
 6 identity_hate    0.116  0.720      0.200      207
 7 micro avg        0.416  0.896      0.569     5211
 8 macro avg        0.327  0.812      0.440     5211
 9 weighted avg     0.495  0.896      0.629     5211
10 samples avg      0.0502 0.0848     0.0597    5211

6 Conclusions

The BiLSTM model is optimized to have an high recall is performing good enough to make predictions for each label. Considering the low support for the threat label, the performance is not bad. See Table 2 and Figure 1: the threat label is only 0.27 % of the observations. The model has been optimized for recall because the cost of not identifying a injurious comment as such is higher than the cost of considering a clean comment as injurious.

Possibile improvements could be to increase the number of observations, expecially for the threat one. In general there are too many clean comments. This could be avoided doing an undersampling of the clean comment, which I explicitly avoided to check the performance on the BiLSTM with an imbalanced dataset, leveraging the class weight method.